-
Couldn't load subscription status.
- Fork 66
Create Python API for VideoEncoder #990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
3c6fd84 to
ee2285e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Dan-Flores , looks great, made a quick first pass
| frames (``torch.Tensor``): The frames to encode. This must be a 4D | ||
| tensor of shape ``(N, C, H, W)`` where N is the number of frames, | ||
| C is 3 channels (RGB), H is height, and W is width. | ||
| A 3D tensor of shape ``(C, H, W)`` is also accepted as a single RGB frame. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A 3D tensor of shape ``(C, H, W)`` is also accepted as a single RGB frame.
Q - Do we need to support that? I'm wondering if it makes a lot of sense to just encode a single image as a video. I suspect this was made to mimic the AudioEncoder behavior but that was a different use-case. In the AudioEncoder we want to allow for 1D audio to be supported as it's still a valid waveform. But I don't think we need to treat a single frame as a valid video.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the use case is for encoding an image as a video, but since FFmpeg allows encoding an image to video, I believe we can retain this functionality for a relatively low cost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what the use case is for encoding an image as a video
I'm not sure either - let's not do it then :). One principle we have is that we should only implement things that we know are useful. In doubt, it's best not to expose it, it makes our life easier as maintainers. Of course if users start to explicitly request this feature, we may reconsider.
In the near future we will start implementing proper image encoders in torchcodec, and another guiding principle is that there should be one-- and preferably only one --obvious way to do it. (ref). If we were to support 3D tensors (images) into the VideoEncoder then we'd potentially expose two ways to do the same thing.
It doesn't mean we'll prevent users from doing it if they really want to: a user can still just call unsqueeze(0) on their 3D tensor and pass it. It will work. But we don't need to explicitly support 3D tensors on our side.
| frame_rate (int): The frame rate to use when encoding the | ||
| **input** ``frames``. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My interpretation of this description is that this parameter actually defines the the frame rate of the encoded output, because of the "frame rate to use when encoding" part.
I think it might be less ambiguous as
The frame rate of the input
frames. This is not the frame rate of the encoded output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use this variable in the Encoder.cpp: avCodecContext_->framerate = {inFrameRate_, 1};.
Perhaps the naming is confusing (should be inputFrameRate rather than inFrameRate), but is it not accurate to say this defines the frame rate of the encoded output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, you're right, frame_rate and inFrameRate currently define both the frame rate of the input and the frame rate of the encoded output.
But in a future implementation, we might let the user choose to encode the output with a different frame rate. If/when that happens, then this frame_rate parameter will only be about the input frame rate, while the frame_rate parameter that we'll add to the decoding method (to_file(frame_rate=...)) will define the output frame rate. It will be similar to what we do in the AudioEncoder with the sample_rate parameters:
- one is for the input, in the constructor:
torchcodec/src/torchcodec/encoders/_audio_encoder.py
Lines 18 to 20 in 28b0346
sample_rate (int): The sample rate of the **input** ``samples``. The sample rate of the encoded output can be specified using the encoding methods (``to_file``, etc.). - one is for the output, in the methods:
torchcodec/src/torchcodec/encoders/_audio_encoder.py
Lines 66 to 67 in 28b0346
sample_rate (int, optional): The sample rate of the encoded output. By default, the sample rate of the input ``samples`` is used.
My original docstring suggestion was definitely wrong (thanks for catching!) but I'd still suggest to clarify that this is really the input frame rate. And, for now, it also happens to be what we use for the output frame rate:
The frame rate of the input frames. Also defines the encoded output frame rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying this - I'll make a note to add frame_rate to the encoding methods (to_file, etc) next.
| Args: | ||
| dest (str or ``pathlib.Path``): The path to the output file, e.g. | ||
| ``video.mp4``. The extension of the file determines the video | ||
| format and container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the distinction between "format" and "container" here? I would just use one or the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The terms are used interchangeably, so I included both here to be understandable to users of both terms. Let me know if that is actually more confusing that just naming one term.
I might have added the distinction after discovering the format mkv is defined as a matroska container. I have not encountered other formats where these are different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like saying
determines the video format and container.
may suggest to some readers that "format" and "container" refer to different things. Especially because a) "format" is sometimes used to refer to the codec, and b) in the methods we only have one single "format" parameter and we never mention the "container".
How about using the term "container format" everywhere in the docstrings? I think that is the actual correct technical term for that (wikipedia):
Notable examples of container formats include [...] formats used for multimedia playback (such as Matroska, MP4, and AVI).
The parameter can still be format - I think this makes sense for consistency with the AudioEncoder, but in the docstrings of all methods we could clarify that format corresponds to the "container format"?
This PR creates a simple
VideoEncoderclass, and updates several tests to utilize theVideoEncoder.to_filepattern.test_bad_input_parameterizedis theVideoEncoderequivalent to the AudioEncoder test that ensures general error checking occurs, while test_bad_input tests method specific errors.test_contiguity: is theVideoEncoderequivalent to the AudioEncoder test to ensure contiguous and non-contiguous tensors can be encoded, and are encoded equivalently.